An Efficient Minimum Vocabulary Construction Algorithm for Language Modeling

نویسندگان

  • Sina Lin
  • Zengchang Qin
  • Zehua Huang
  • Tao Wan
چکیده

In learning a new word by a dictionary, we first need to know a set of “basic words” which are frequently appeared in word definitions. It often happens that you cannot understand the word you looked up because there are still some words you do not understand in its definitions or explanations provided by the dictionary. You can keep looking up these new words recursively till they all can be well explained by some basic words you already knew. How to automatically find a minimum set of such basic words to define (or recursively define) the entire vocabulary in a given dictionary is what are going to discuss in this paper. We propose an efficient algorithm to construct the Minimum Vocabulary (MV) using the word frequency information. The minimum vocabulary can be used for language modeling and experimental results demonstrate the effectiveness of using the minimum vocabulary as features in text classification.

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Unlimited vocabulary speech recognition based on morphs discovered in an unsupervised manner

We study continuous speech recognition based on sub-word units found in an unsupervised fashion. For agglutinative languages like Finnish, traditional word-based n-gram language modeling does not work well due to the huge number of different word forms. We use a method based on the Minimum Description Length principle to split words statistically into subword units allowing efficient language m...

متن کامل

Full expansion of context-dependent networks in large vocabulary speech recognition

We combine our earlier approach to context-dependent network representation with our algorithm for determinizing weighted networks to build optimized networks for large-vocabulary speech recognition combining an n-gram language model, a pronunciation dictionary and context-dependency modeling. While fullyexpanded networks have been used before in restrictive settings (medium vocabulary or no cr...

متن کامل

Integrated modeling and solving the resource allocation problem and task scheduling in the cloud computing environment

Cloud computing is considered to be a new service provider technology for users and businesses. However, the cloud environment is facing a number of challenges. Resource allocation in a way that is optimum for users and cloud providers is difficult because of lack of data sharing between them. On the other hand, job scheduling is a basic issue and at the same time a big challenge in reaching hi...

متن کامل

Life-wise Language Learning Textbooks: Construction and Validation of an Emotional Abilities Scale through Rasch Modeling

Underlying the recently developed notions of applied ELT and life syllabus is the idea that language classes should give precedence to learners’ life qualities, for instance emotional intelligence (EI), over and above their language skills. By so doing, ELT is ascribed an autonomous status and ELT classes can lavish their full potentials to the learners. With that in mind, this study aimed to d...

متن کامل

Data driven subword unit modeling for speech recognition and its application to interactive reading tutors

This paper proposes a novel token-passing search architecture for supporting subword unit based speech recognition and a corresponding algorithm based on the well-known LZW text compression method to determine a vocabulary of subword units in an unsupervised manner. We compare our subword unit selection algorithm to an existing approach based on Minimum Description Length (MDL) modeling and als...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2012